Introduction

The following report presents the data preparation process for the Remote Working Survey, conducted in Australian state of New South Wale in August-September 2020. The cleaned-up dataset will be further investigated in the STATA statistical package to gain significant insights from the data and explore the factors influencing employees’ productivity.

First of all, the variables in the dataset will be renamed, as the existing ones are namely the list of questions, which is hard both to read and use during preparation. Moreover, variables containing last year’s expressions regarding the remote or hybrid working systems will not be taken under consideration as a big bias in results is assumed due to completely new experience for many of the workers.

data_raw <- data.table::fread("data/2020_rws.csv", data.table = FALSE) %>% 
  as_tibble() %>% 
  select(-contains("last year"))

names_new <- c(
  "ID", "YearBorn", "Gender", "Industry", "IndustryAdv", 
  "Occupation", "OccupationAdv", "CompanySize", "IsManager", "HouseHold",
  "Experience", "Transportation", "TimeRemote", "CompanyEncourageToRemote", "CompanyPreparedForRemote",
  "CompanyCommonToRemote", "CompanyEasyToRemote", "EasyCollaborateRemote", "RecommendRemote", "TimeRemotePreffered",
  "TimeRemotePrefferedAfterCovid", "CompanyEncourageToRemoteAfterCovid", "CompanySupportRemoteAfterCovid", "ChoiceToRemoteAfterCovid", "Productivity",
  "CommuteOnsiteH", "WorkOnsiteH", "PersonalOnsiteH", "DomesticOnsiteH", "CommuteRemoteH", 
  "WorkRemoteH", "PersonalRemoteH", "DomesticRemoteH", "bp1", "sp1",
  "bp2", "sp2", "bp3", "sp3", "bp4",
  "sp4", "bp5", "sp5", "bp6", "sp6",
  "bp7", "sp7", "bp8", "sp8", "bp9",
  "sp9", "bp10", "sp10", "ba1", "wa1",
  "ba2", "wa2", "ba3", "wa3", "ba4",
  "wa4", "ba5", "wa5", "ba6", "wa6"
)

df <- data_raw %>% `colnames<-`(names_new)

There are also 4 conventions assumed for new variables’ names:

  • bp columns means answers for the question From the following, please select the most significant barrier to doing your work remotely,
  • sp — for the question From the following, please select the least significant barrier to doing your work remotely,
  • ba — Compare remote working to working at your employer’s workplace. Select the best aspect of remote working for you,
  • wa — Compare remote working to working at your employer’s workplace. Select the worst aspect of remote working for you

From now on, we could continue the data investigation and preparation process.




Age

Let’s have a look at Age column, is the difference between YearBorn column and 2020:

df <- df %>% mutate(Age = 2020-YearBorn, .keep = "unused")

We notice that there are 4 significant outliers in the Age distribution. For now, we will keep 2 respondents with Age around 80 and drop only 2 answers where Age was greater than 100.

df <- df %>% filter(Age < 100)




Gender

Going further, we will take a closer look at the Gender distribution in the dataset.

As we could see, there are 200 more men than women. What is more, there are 2 answers missing; in this case they will be just excluded from the final dataset as there is no reason to consider such a category statistically significant.

df <- df %>% filter(Gender != "Rather not say")




Industry

Next, we will examine the Industry’s and IndustryAdv’s distributions.

IndustryAdv column seems to be too diversified to be used in a sample of 1500 respondents. Even the Industry column seems too rare for some categories. Therefore, we will recode the Industry column to a new one called Sector, which will be divided into 4 categories: primary and other (farming, mining, etc.), secondary (manufacturing, etc.), tertiary quinary (public services, healthcare, etc.) and tertiary quaternary (intellectual professionals).

switch_industry <- function(x) {
  switch(
    x,
    "Accommodation and Food Services" = "Tertiary quinary",
    "Administrative and Support Services" = "Tertiary quinary",
    "Agriculture, Forestry and Fishing" = "Primary and Other",
    "Arts and Recreation Services" = "Tertiary quinary",
    "Construction" = "Secondary",
    "Education and Training" = "Tertiary quinary",
    "Electricity, Gas, Water and Waste Services" = "Tertiary quinary",
    "Financial and Insurance Services" = "Tertiary quaternary",
    "Health Care and Social Assistance" = "Tertiary quinary",
    "Information Media and Telecommunications" = "Tertiary quaternary",
    "Manufacturing" = "Secondary",
    "Mining" = "Primary and Other",
    "Professional, Scientific and Technical Services" = "Tertiary quaternary",
    "Public Administration and Safety" = "Tertiary quinary",
    "Rental, Hiring and Real Estate Services" = "Tertiary quinary",
    "Retail Trade" = "Tertiary quinary",
    "Transport, Postal and Warehousing" = "Secondary",
    "Wholesale Trade" = "Tertiary quinary",
    "Other Services" = "Primary and Other",
    x
  )
}
df <- df %>% mutate(Sector = map_chr(Industry, switch_industry)) %>% select(-c(Industry, IndustryAdv))

Current dataset contains responses from people working in the tertiary sector mainly, which seems quite obvious as it is hard to imagine a farmer working 100% remotely. Considering that, it might be unnecessary to use this variable as a predictor.




Occupation

We will proceed with analyzing the followng columns: Occupation, OccupationAdv, IsManager

## Warning: Ignoring 135 observations

Unfortunately, as IsManager column contained missing observations, it won’t be used as a regressor. Furthermore, the OccupationAdv column will be dropped, just as the IndustryAdv. Finally, all the minor categories for the Occupation variable will be renamed to ‘Other’

switch_occupation <- function(x) {
  switch(
    x,
    "Community and personal service" = "Other",
    "Labourers" = "Other",
    "Machinery operators and drivers" = "Other",
    "Sales" = "Other",
    "Technicians and trades" = "Other",
    "Professionals" = "Professional",
    "Managers" = "Manager",
    "Clerical and administrative" = "Clerk",
    x
  )
}

df <- df %>% mutate(Occupation = map_chr(Occupation, switch_occupation)) %>% select(-c(IsManager, OccupationAdv))




Company size, household, experience & transportation

Next, we will investigate the CompanySize, HouseHold, Experience, Transportation columns.

Transportation and Experience columns will be kept as they are. In case of CompanySize, though, it will be restricted to 3 categories: 1-19, 20-199 and 200+. As for the HouseHold variable, it will contain the couple+children, couple, single and other levels to avoid unnecessary sparsity.

switch_household <- function(x) {
  switch(
    x,
    "Couple with dependent children" = "Couple+children",
    "Couple with no dependent children" = "Couple",
    "Single person" = "Single",
    "Other"
  )
}

switch_size <- function(x) {
  switch(
    x,
    "Between 1 and 4" = "1-19",
    "Between 20 and 199" = "20-199",
    "Between 5 and 19" = "1-19",
    "I am the only employee" = "1-19",
    "More than 200" = "200+",
    x
  )
}

df <- df %>% 
  mutate(
    CompanySize = map_chr(CompanySize, switch_size),
    HouseHold = map_chr(HouseHold, switch_household),
  )




Working system

Next, we will explore variables related to actual time being spent remote and the preferred amount of remote work in hours. The AfterCovid variable will be used to refer to a variable expressing people’s preferences, as the further analysis will concentrate on the after-covid times.

df <- df %>% select(-TimeRemotePreffered)

As we are mostly interested in the remote and hybrid work, we will throw away respondents who spend (almost) all their working hours onsite. Moreover, we will assume that people working remotely in (less than) 10% cases will be classified as ‘almost onsite’, from 10 to 90 - working ‘hybrid’, and those working 100% will be called ‘Remote’ workers.




Companies’ and employees’ atttude

Regarding the CompanyEncourageToRemote and CompanyPreparedForRemote, just the same as it was with the TimePrefferedAfterCovid, they will be dropped as we are interested in the long-term preferences. As for the EasyCollaborateRemote column, it will not be used throughout the research, as it is correlated with the sp and pb columns, which will be examined later. As for the RecommendRemote and ChooseRemote columns, they will be restricted to ‘yes’|‘no’|‘hard to say’ categories to avoid small groups division later during the analysis and model creation.

switch_answer <- function(x) {
  switch(
    x,
    "Strongly disagree" = "no",
    "Somewhat disagree" = "no",
    "Neither agree nor disagree" = "hard to say",
    "Somewhat agree" = "yes",
    "Strongly agree" = "yes", 
  x)
}

df <- df %>% 
  select(-c(CompanyEncourageToRemote, CompanyPreparedForRemote, EasyCollaborateRemote)) %>% 
  mutate(
    RecommendRemote = map_chr(RecommendRemote, switch_answer),
    ChooseRemote = map_chr(ChoiceToRemoteAfterCovid, switch_answer),
    .keep = "unused"
  )

Going further, as CommonToRemote and EasyToRemote are pretty strongly correlated (see code output), we will use only CommonToRemote one. The same story goes with CompanyEncourageToRemoteAfterCovid and CompanySupportRemoteAfterCovid as it is naturally hard to imagine encouraging remote work without supporting it.

df <- df %>% 
  mutate(
    EasyRemote = map_chr(CompanyEasyToRemote, switch_answer),
    CommonRemote = map_chr(CompanyCommonToRemote, switch_answer),
    EncourageRemote = map_chr(CompanyEncourageToRemoteAfterCovid, switch_answer),
    SupportRemote = map_chr(CompanySupportRemoteAfterCovid, switch_answer),
    .keep = "unused"
  )
df_temp1 <- df %>% filter(EasyRemote!="hard to say", CommonRemote!="hard to say")
df_temp2 <- df %>% filter(EncourageRemote!="hard to say", SupportRemote!="hard to say")
print(round(sum(df_temp1$EasyRemote != df_temp1$CommonRemote)/nrow(df_temp1), 4)*100)
## [1] 8.31
print(round(sum(df_temp2$EncourageRemote != df_temp2$SupportRemote)/nrow(df_temp2), 4)*100)
## [1] 5.13
df <- df %>% select(-c(EasyRemote, EncourageRemote))




Daily activities’ distributions

Next, we continue the analysis for activities-related variables. Firstly, we will replace all the zero variables with 0.01. This will be needed to take into account cases, when the difference between onsite and remote hours spent would be very noticeable. Secondly, we will exclude observations with total hours spent on mentioned activities greater than 24.

df <- df %>% 
  mutate(
    WorkOnsiteH = ifelse(WorkOnsiteH == 0, 0.01, WorkOnsiteH),
    PersonalOnsiteH = ifelse(PersonalOnsiteH == 0, 0.01, PersonalOnsiteH),
    DomesticOnsiteH = ifelse(DomesticOnsiteH == 0, 0.01, DomesticOnsiteH),
    CommuteOnsiteH = ifelse(CommuteOnsiteH == 0, 0.01, DomesticOnsiteH),
    WorkRemoteH = ifelse(WorkRemoteH == 0, 0.01, WorkRemoteH),
    PersonalRemoteH = ifelse(PersonalRemoteH == 0, 0.01, PersonalRemoteH),
    DomesticRemoteH = ifelse(DomesticRemoteH == 0, 0.01, DomesticRemoteH),
    CommuteRemoteH = ifelse(CommuteRemoteH == 0, 0.01, CommuteRemoteH),
  ) %>% 
  mutate(
    WorkingHoursDiff = round((WorkRemoteH-WorkOnsiteH)/WorkOnsiteH, 4)*100,
    PersonalHoursDiff = round((PersonalRemoteH-PersonalOnsiteH)/PersonalOnsiteH, 4)*100,
    DomesticHoursDiff = round((DomesticRemoteH-DomesticOnsiteH)/DomesticOnsiteH, 4)*100,
    CommuteHoursDiff = round((CommuteRemoteH-CommuteOnsiteH)/CommuteOnsiteH, 4)*100,
    TotalHoursOnsite = WorkOnsiteH+CommuteOnsiteH+DomesticOnsiteH+PersonalOnsiteH,
    TotalHoursRemote = WorkRemoteH+CommuteRemoteH+DomesticRemoteH+PersonalRemoteH,
    .keep = "unused"
  ) %>% 
  mutate(
    WorkingHoursDiff = ifelse(is.na(WorkingHoursDiff), 0, WorkingHoursDiff),
    PersonalHoursDiff = ifelse(is.na(PersonalHoursDiff), 0, PersonalHoursDiff),
    DomesticHoursDiff = ifelse(is.na(DomesticHoursDiff), 0, DomesticHoursDiff),
    CommuteHoursDiff = ifelse(is.na(CommuteHoursDiff), 0, DomesticHoursDiff),
  ) %>% 
  filter(TotalHoursOnsite <= 24, TotalHoursRemote<=24, Age < 100) %>% 
  select(-c(TotalHoursRemote, TotalHoursOnsite))

Looking at the densities of mentioned variables, it seems like a value replacement is needed to be used. Therefore, values greater than 100 or lower than -100 will be recoded as follows:

replace_diff <- function(x) {ifelse(x>100, 100, x) %>% ifelse(.< -100, -100, .)}

df <- df %>% 
  mutate(across(c("WorkingHoursDiff", "PersonalHoursDiff", "DomesticHoursDiff", "CommuteHoursDiff"), replace_diff))

However, it is still hard to see any dependency, as the biggest bin takes both negative and positive values. Therefore, we will postpone the deeper analysis of the mentioned variables for later.




Biggest challenge

One of the hardest variable in terms of data preparation are the biggest challenge, faced during the remote work period. It will be calculated as follows:

  1. All the unique answers for questions bp1-bp15 will be counted for each individual;
  2. All the unique answers for questions sp1-sp15 will be counted for each individual;
  3. The aforementioned tables will be joined, and the counts for each unique answers for each individual will be subtracted;
  4. 0 values will be replaced with ‘No significant challenge’;
  5. Negative values will be replaced with ‘No specified challenge’;
  6. The most popular challenge, though, will be classified as the biggest challenge faced.
more_significant_tbl <- df %>%
  select(ID, starts_with("bp")) %>% 
  gather(key = "column", value = "value", -ID) %>%
  group_by(value, ID) %>%
  summarise(
    n_max = n()
  ) %>% 
  ungroup() %>% 
  arrange(ID, desc(n_max))
less_significant_tbl <- df %>%
  select(ID, starts_with("sp")) %>% 
  gather(key = "column", value = "value", -ID) %>%
  group_by(value, ID) %>%
  summarise(
    n_min = n()
  ) %>% 
  ungroup() %>% 
  arrange(ID, desc(n_min))

problems_tbl <- more_significant_tbl %>% 
  full_join(less_significant_tbl, by = c("ID", "value")) %>% 
  mutate(
    n_min = ifelse(is.na(n_min), 0, n_min),
    n_max = ifelse(is.na(n_max), 0, n_max),
    n = n_max - n_min
  ) %>% 
  select(-c(n_max, n_min)) %>% 
  arrange(ID, desc(n)) %>% 
  mutate(
    value = ifelse(n<0, "No significant challenge", value),
    value = ifelse(n==0, "No specified challenge", value)
  ) %>% 
  group_by(ID) %>%
  rename(nbr = n) %>% 
  top_n(n = 1) %>%
  mutate(rank = paste0("top", row_number())) %>%
  filter(rank %in% c("top1")) %>% 
  ungroup() %>%
  mutate(
    BiggestChallenge = ifelse(rank == "top1", value, NA),
  ) %>% 
  select(ID, BiggestChallenge)


df <- df %>% select(-c(starts_with("bp"), starts_with("sp"))) %>% left_join(problems_tbl, by = "ID")

As we could see, the resulted BiggestChallenge column is too sparse. Therefore, we will combine all the IT/equipment issues into single one called ‘Technical’. We will also unite ‘my workspace’ and ‘my living situation’ answers into single one. The other options will be just shortened to more human-like variants.

switch_challenge <- function(x) {
  switch(
    x,
    "Connectivity (internet connection)" = "Technical",
    "Feeling left out and/or isolated" = "Isolation",
    "My organisation's software and systems" = "Technical",
    "My workspace (e.g. suitable chair, lighting, noise levels, facilities)" = "Working conditions",
    "I have tasks that can't be done remotely" = "Impossible tasks",
    "IT equipment (computer, printer, etc.)" = "Technical",
    "Difficulty collaborating remotely" = "Collaboration",
    "Caring responsibilities" = "Caring",
    "Cyber security" = "Technical",
    "Lack of motivation" = "Motivation",
    "My living situation (e.g. location, home size, who I live with)" = "Working conditions",
    "Lack of remote working skills" = "Self management",
    "Poor management" = "Self management",
    "Management discourages remote working" = "Company's policy",
    x
  )
}
df <- df %>% mutate(BiggestChallenge = map_chr(BiggestChallenge, switch_challenge))




Biggest benefit

Almost the same procedure will be applied to the biggest benefit variable with the only difference in final category’s names.

more_significant_tbl <- df %>%
  select(ID, starts_with("ba")) %>% 
  gather(key = "column", value = "value", -ID) %>%
  group_by(value, ID) %>%
  summarise(
    n_max = n()
  ) %>% 
  ungroup() %>% 
  arrange(ID, desc(n_max))
less_significant_tbl <- df %>%
  select(ID, starts_with("wa")) %>% 
  gather(key = "column", value = "value", -ID) %>%
  group_by(value, ID) %>%
  summarise(
    n_min = n()
  ) %>% 
  ungroup() %>% 
  arrange(ID, desc(n_min))

changes_tbl <- more_significant_tbl %>% 
  full_join(less_significant_tbl, by = c("ID", "value")) %>% 
  mutate(
    n_min = ifelse(is.na(n_min), 0, n_min),
    n_max = ifelse(is.na(n_max), 0, n_max),
    n = n_max - n_min
  ) %>% 
  select(-c(n_max, n_min)) %>% 
  arrange(ID, desc(n)) %>% 
  mutate(
    value = ifelse(n<0, "No significant challenge", value),
    value = ifelse(n==0, "No specified challenge", value)
  ) %>% 
  group_by(ID) %>%
  rename(nbr = n) %>% 
  top_n(n = 3) %>%
  mutate(rank = paste0("top", row_number())) %>%
  filter(rank %in% c("top1")) %>% 
  ungroup() %>%
  mutate(
    BiggestBenefit = ifelse(rank == "top1", value, NA),
  ) %>% 
  select(ID, BiggestBenefit)

df <- df %>% select(-c(starts_with("wa"), starts_with("ba"))) %>% left_join(changes_tbl, by = "ID")

As in case of BiggestChallenge column, the number of categories will be cut down. Ability to socialize and other relationship variables will be united into a single group. Commuting, on-learning opportunities will be united into category ‘Professional’. Family responsibilities will be included as part of work-life balance. The left 3 categories will be left unchanged.

switch_benefit <- function(x) {
  switch(
    x,
    "Managing my family responsibilities" = "Work-life balance",
    "Managing my personal commitments" = "Professional",
    "My daily expenses" = "Less expenses",
    "My job satisfaction" = "Job satisfaction",
    "My mental wellbeing" = "Mental wellbeing",
    "My on-the-job learning opportunities" = "Professional",
    "My opportunities to socialise" = "Relationships",
    "My personal relationships" = "Relationships",
    "My work-life balance" = "Work-life balance",
    "My working relationships" = "Relationships",
    "Preparing for work and commuting" = "Professional",
    "The number of hours  I work" = "Work-life balance",
    x
  )
}

df <- df %>% mutate(BiggestBenefit = map_chr(BiggestBenefit, switch_benefit))




Productivity

Last but not least is the Productivity variable which will be the core interest in the further research.

We can’t but mention that there are much less people claiming working less productive by 40% and 50% percent. It might be also related with the fear of being suspected in working not effectively and being fired, as the productivity was measured by the respondents, not third parties. Respondents might also overestimate their productivity to make impression, especially if we are talking about small companies (ex. 1-4 people), where it will be pretty obvious which answers belong to whom, regarding the animosity of such a survey. Nevertheless, this aspect will be discussed further during the very research.

The study will concentrate on the very fact, if the change in productivity occurred at all, and in what direction. As we will not be interested in the degree of that change, we will restrict the variable to only 3 possible options: dropped, increased or stay the same.

df <- df %>% 
  mutate(
    Productivity = ifelse(str_detect(Productivity, "less"), "Dropped", ifelse(str_detect(Productivity, "more"), "Increased", "Same level"))
  )

Finalization

In this way, we cleared dataset to up to 1250 observations with 22 potential regressors. The dataset will be saved to be used later in STATA.

df %>% write_csv("data/remote_productivity.csv")

print(df %>% ncol())
## [1] 22
print(df %>% nrow())
## [1] 1250